Record batch machine command timings#1236
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the AKS Machine scale-out path to capture per-batch BatchPutMachine timing metrics and emit them as operation metadata, with accompanying test updates to validate metric capture and metadata emission.
Changes:
- Record per-batch start/end timestamps and execution duration in
_create_batch_machines. - Emit
batch_command_execution_timesasscale_machineoperation metadata whenuse_batch_api=True. - Update machine-client tests to cover timing metric capture and metadata emission.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| modules/python/clients/aks_machine_client.py | Captures per-batch timing metrics and emits them as scale_machine operation metadata. |
| modules/python/tests/test_aks_machine_client.py | Extends tests to assert timing metrics are captured and metadata is emitted for the batch path. |
Comments suppressed due to low confidence (1)
modules/python/clients/aks_machine_client.py:921
execution_time_secondsis derived fromdatetime.now()deltas, which can go negative or be skewed if the system clock adjusts (e.g., NTP). For durations, use a monotonic timer (time.perf_counter()/time.monotonic()) and keepdatetimeonly for the wall-clock timestamps.
start_time = datetime.now(timezone.utc)
self._make_batch_request(
"PUT",
url,
body,
Code reviewNo issues found. Checked for bugs and AGENTS.md compliance. |
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
This reverts commit a21a0d1.
| execution_time_seconds = (end_time - start_time).total_seconds() | ||
| request.batch_command_execution_times[first_machine_name] = { | ||
| "start_time": start_time.strftime("%Y-%m-%dT%H:%M:%SZ"), | ||
| "end_time": end_time.strftime("%Y-%m-%dT%H:%M:%SZ"), | ||
| "execution_time_seconds": execution_time_seconds, | ||
| "total_machines_in_batch": len(chunk), | ||
| } |
There was a problem hiding this comment.
Should record batch timing even when the BatchPutMachine request fails.
Right now the metric is written only after _make_batch_request() returns successfully. If it raises after retries / timeout / non-2xx, _scale_machine_batch() catches that worker failure and returns [], so the operation only reports landed X/Y machines and the failed batch has no timing entry. That makes the new metric biased toward successful batches and hides the slow/failing batch we most need for debugging.
Could we wrap the request in try/finally so every batch records elapsed time, and include a small status/error field?
Summary
Tests